Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Sun Nov 26 15:23:03 2023"
The text continues here.
e.g., How are you feeling right now? What do you expect to learn? Where did you hear about the course?
As an R beginners, I feel this course is very friendly. Both the instructions from the teacher and the book are clear enough for me to understand and move on easily. The learning curve looks smooth, and the instructions are also encouraging. All these provide confidence for a beginner. I have been impressed by the versatile and powerful usage of R, so I want to add it into my research tool box although my research currently does not involve it (potential in the future if I own the skill). In other words, learning R is my interest or hobby at this moment. It is always good to learn some fresh things in my spare time - R is the one recently! I heard about this course from my colleague who usually can make many impressive graphs with statistical data. He became really skillful in operating R and realizing many functions after one-year learning, starting from this course one year ago. His feedback on this course and the learning outcomes really encouraged me to get engaged into R from this course.
How did it work as a “crash course” on modern R tools and using RStudio? Which were your favorite topics? Which topics were most difficult? Some other comments on the book and our new approach of getting started with R Markdown etc.?
This book is clear, well instructional, and encouraging for an R beginner. I enjoy learning the part of data wrangling and visualization, which is easily accessible and readable. My favorite character of R is the instant feedback after executing the commands, no matter right or wrong. This is what a researcher is eager for, but seldom meets in the academic career after making some efforts. However, I could feel challenging in the data analysis which needs specialized knowledge in statistics to interpret the R outcomes. It could be easy to run the analysis functions but not easy to understand the meaning of the results. Therefore, more things should be learned when I move on to the data analysis in R. The book has provided so many instructions to go through and practice, which are clear to follow. I think it is very nice and enough for me to get start with R. Just need to spend time in it.
Following the detail instruction in “Start me up”, I set up the technical platforms needed in this course, including R, RStudio, and GitHub. Following the teacher’s instruction, I got the know how to push the updated script to the GitHub. Then I started the learning of R with the book “R for Health Data Science”, and the corresponding exercise 1. In this week, I finished the learning of Chapter 1-5 in the book, which was a heavy and time-consuming task, but rewarding! In addition, I read the recommended Part 1 of the textbook “MABS4IODS”, and the two webinars on RStudio and GitHub.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Sun Nov 26 15:23:03 2023"
Data analysis assignment
Read the students2014 data into R and explore the structure and the dimensions of the data
students2014 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=TRUE)
dim(students2014)
## [1] 166 7
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Description: The data set consists of 166 observations of 7 variables. The data present the information of gender, age, attitude, learning points of 166 students, and their responses to the questions related to deep, surface and strategic learning.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# For making a nicer plot
my_lower <- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_point(..., alpha=1, pch = 21, color ="black") +
theme(
panel.background = element_rect(fill = "white", color = "black")
)
}
my_diag <- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_density(..., alpha=0.7, color ="black") +
theme(
panel.background = element_rect(fill = "white", color = "black")
)
}
my_discrete <- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_bar(..., alpha=1, color ="black") +
theme(
panel.background = element_rect(fill = "white", color = "black")
)
}
my_upper_combo<- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_boxplot(...,alpha=1, color ="black") +
geom_jitter(...,alpha = 0.5, width = 0.25, size = 1, pch = 21, color = "black")+
theme(
panel.background = element_rect(fill = "white", color = "black")
)
}
# Plot
p <- ggpairs(students2014,
mapping = aes(col = gender, fill = gender, alpha = 0.7),
lower = list(
continuous = my_lower,
combo = wrap("facethist", bins = 20, color = "black")
),
diag = list(
continuous = my_diag,
discrete = my_discrete
),
upper = list(
combo = my_upper_combo,
continuous = wrap("cor", color = "black")
)
)+
scale_fill_manual(values = c("lightblue", "wheat"))+
scale_color_manual(values = c("lightblue", "wheat"))
# Print the plot
p
# Summaries of the variables in the data
summary(students2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Description: Most of the numeric variables in the data are relatively normally distributed, except for the age which is mostly distributed around twenties. Female are about two times more than males in frequency. Within the variables, attitude towards statistics seems to be most strongly correlated with exam points (r=0.437).
Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
# Choose attitude, stra, and surf as the three explanatory variables because they have highest (absolute) correlation with the target variable exam points, according to the plot matrix.
model1 <- lm(points ~ attitude + stra + surf, data = students2014)
# Print out the summary of the first model
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
# Remove the unfit variables and new fitted regression model
model2 <- lm(points ~ attitude, data = students2014)
# Print out the summary of the fitted model
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Description: “Attitude” was the single significant predictor in the model. Other two variables entered model–“stra” and “surf”–had p-values 0.12 and 0.47, respectively. Therefore, the model was re-fitted with “attitude” alone, producing a final model that explains 19.06% (R squared = 0.1906) of the variance of the response (exam points).
Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model.
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Description: The resulting model shows attitude is a meaningful predictor for exam points, specifically:
- a. With a unit of increase in attitude, the exam will increase by 3.5 points
- b. When the effect of attitude is held as none, the average exam points is 11.637. In other words, students would be expected to have 11.637 points in the exam if attitude did not influence at all.
- c. The model predicts a fairly small proportion (19%) of change in exam points. In other words, attitude is not a good enough predictor for exam points, even though its role in influencing the results should be recognized. Random error or other factors should have played roles in the exam points.
Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.
par(mfrow = c(2,2))
plot(model2, which = c(1,2,5))
Description:
a. Residuals vs fitted plot shows that the data points are randomly scattered around the dotted line of y = 0, and the fitted line (red) is roughly horizontal without distinct patterns or trends, indicating a linear relationship. The linearity assumption of linear regression is examined.
b. The QQ plot shows that most of the points plotted on the graph lies on the dashed straight line, except for the lower and upper ends, where some points deviated from the line, indicating the distribution might be slightly skewed. Nevertheless, the assumption of normality can be approximately met, considering that in large sample size, the assumption of linearity is almost never perfectly met.
c. The Residuals vs Leverage plot indicates that there is no three outlier observation with large Cook’s distances. If the plot shows any outlier observation, they are recommended to be removed from the data because they have high influence in the linear model.
date()
## [1] "Sun Nov 26 15:23:07 2023"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
# read the data
alc <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=TRUE)
# print out the names of the variables
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
dim(alc)
## [1] 370 35
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
This data set includes the information of 370 students’ background, achievement, and alcohol consumption in two Portuguese schools. There are 35 variables in the data, including student grades, demographic, social and school related features, as well as students’ performance in Mathematics (mat) and Portuguese language (por). The data was collected by using school reports and questionnaires.
The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption.
Based on my everyday observation and some research reports about teenagers’ alcohol consumpution, I would like to choose family relationships (“famrel”), number of school absences (“absences”), weekly study time (“studytime”) and frequency of going out with friends (“goout”) as candidate indicators to predict the alcohol consumption. Htpothesis: a student belongs to the group of high alcohol consumption if he or she (1) has low quality family relationship; (2) more frequency of school absences; (3) less weekly study time; and (4) more frequency of going out with friends.
Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses.
library(ggplot2)
library(tidyr)
alc |>
select(absences, studytime, famrel, goout) |>
pivot_longer(everything(),
names_to = "variable",
values_to = "value") |>
ggplot(aes(x = value))+
facet_wrap(~variable, scales = "free")+
geom_bar()+
labs(title = "Distribution of the interested variables",
x = "Values of each variable",
y = "Frequency")
# The relationship between students' absences of class and alcohol consumption (box plot for numerical~categorical variables)
p1 <- alc |>
ggplot(aes(x = high_use, y = absences)) +
geom_boxplot() +
geom_jitter(width=0.25, alpha=0.5)+
labs(x = "Alcohol consumption",
y = "Freuqncy of class absences",
title =
"Frequency of class absences and alcohol consumption")+
scale_x_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))
p1
# The relationship between quality of family relation and alcohol consumption (bar plot for categorical variables)
p2 <- alc |>
ggplot(aes(x = factor(famrel), fill = high_use)) +
geom_bar(position = "fill", color = "black") +
labs(x = "Quality of family relationships",
y = "Proportion of alcohol high-users",
title =
"Quality of family relationships and alcohol consumption")+
guides(fill=guide_legend("Alcohol consumption"))+
scale_fill_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))
p2
# The relationship between going out with friends and alcohol consumption (bar plot for categorical variables)
p3 <- alc |>
ggplot(aes(x = factor(goout), fill = high_use)) +
geom_bar(position = "fill", color = "black") +
labs(x = "Going out with friends",
y = "Proportion of alcohol high-users",
title =
"Going out with friends and alcohol consumption")+
guides(fill=guide_legend("Alcohol consumption"))+
scale_fill_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))
p3
# The relationship between weekly study time and alcohol consumption (bar plot for categorical variables)
p4 <- alc |>
ggplot(aes(x = factor(studytime), fill = high_use)) +
geom_bar(position = "fill", color = "black") +
labs(x = "Weekly study time",
y = "Proportion of alcohol high-users",
title =
"Weekly study time and alcohol consumption")+
guides(fill=guide_legend("Alcohol consumption"))+
scale_fill_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))
p4
# combining four plots together
library(patchwork)
p1 + p2 + p3 + p4
Four plots were made to explore the relationships between four variables and the alcohol consumption. (1) The box plot shows that the frequency and median(Q1,Q3) of class absences differed greatly between alcohol high-users and non-high-users. The students with high alcohol use are associated with more class absences, as hypothesized. (2) The first bar plot shows a difference result from the hypothesis in terms of the relationship between quality of family relation and alcohol consumption. Students who have worst relationship with their family are not the highest consumption group. Rather the students who consume most alcohol are those who have bad or middle level of family relationship. (3) The second bar plot shows that the more frequency students going out with their friends, the more alcohol consumption, as hypothesized. (4) The third bar plot shows that the more time students spend in studying every week, that less alcohol consumption, as hypothesized.
Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable. Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis. Hint: If your model includes factor variables, see for example the RHDS book or the first answer of this stack exchange thread on how R treats and how you should interpret these variables in the model output (or use some other resource to study this).
# Fit the model base on the original hypothesis
model <- glm(high_use ~ absences + famrel + goout + studytime, data = alc, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ absences + famrel + goout + studytime,
## family = "binomial", data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.15221 0.71444 -1.613 0.106798
## absences 0.06655 0.02192 3.036 0.002394 **
## famrel -0.35898 0.13796 -2.602 0.009266 **
## goout 0.75990 0.12143 6.258 3.91e-10 ***
## studytime -0.57194 0.16963 -3.372 0.000747 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 370.36 on 365 degrees of freedom
## AIC: 380.36
##
## Number of Fisher Scoring iterations: 4
# Important parameters
OR <- coef(model)
CI <- confint(model)
## Waiting for profiling to be done...
ORCI <- cbind(OR, CI)
print(ORCI, digits = 1)
## OR 2.5 % 97.5 %
## (Intercept) -1.15 -2.57 0.24
## absences 0.07 0.02 0.11
## famrel -0.36 -0.63 -0.09
## goout 0.76 0.53 1.01
## studytime -0.57 -0.92 -0.25
All of the hypothesized predictors have good level of statistical significance in the model (p<0.01), indicating that all of the four hypothesized predictors are significant in predicting alcohol consumption. These four predictors have different influences on alcohol consumption. (1) Absences: Participants who have one more time of absence from class will on average have 0.067 (95%CI: 0.02~0.11) times more odds being an alcohol high-user. (2) Quality of family relationships: Every unit of family relationship quality increase would lead to 0.36 (95%CI: 0.09~0.63) times less odds being alcohol high-user. (3) Going out with friends: Students who have one more level of social involvement with their friends have 0.76 (95%CI: 0.53~1.01) times more odds of being alcohol high-users. (4) Weekly study time: Those who have one more level of weekly study time have on average 0.57 (95%CI: 0.25~0.92) times less odds to be an alcohol high-user. These findings are consistent with previous hypotheses.
Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.
# Explore the predictive power of the model
probabilities <- predict(model, type = "response")
# Add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# Use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# See the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, famrel, goout, studytime, high_use, probability, prediction) %>% tail(10)
## absences famrel goout studytime high_use probability prediction
## 361 3 4 3 2 FALSE 0.22224385 FALSE
## 362 0 4 2 1 FALSE 0.16242879 FALSE
## 363 7 5 3 1 TRUE 0.31573116 FALSE
## 364 1 5 3 3 FALSE 0.08975198 FALSE
## 365 6 4 3 1 FALSE 0.38200795 FALSE
## 366 2 5 2 3 FALSE 0.04697548 FALSE
## 367 2 4 4 2 FALSE 0.36371174 FALSE
## 368 3 1 1 2 FALSE 0.15505370 FALSE
## 369 4 2 5 1 TRUE 0.83529411 TRUE
## 370 2 4 1 1 TRUE 0.09388808 FALSE
# Tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%
addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 233 26 259
## TRUE 63 48 111
## Sum 296 74 370
# Display a graphic visualizing actual high_use and predictions
p5 <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))+
geom_point()
p5
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%
prop.table()%>%
addmargins()%>%
print(digits = 2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63 0.07 0.70
## TRUE 0.17 0.13 0.30
## Sum 0.80 0.20 1.00
Among 259 participants who are not alcohol high-users, the model correctly predicts 233 (90%) of them. Among 111 participants who are alcohol high-users, the model correctly predicts 63 of them (57%) of them. In all, among the 370 predicts, 74 (20%) were inaccurate.
Bonus: Perform 10-fold cross-validation on your model. Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in the Exercise Set (which had about 0.26 error). Could you find such a model?
# Define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2405405
# K-fold cross-validation
library(boot)
set.seed(1)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K = 10)
# average number of wrong predictions in the cross validation
delta.percent <- paste0((cv$delta[1]|> round(4))*100, "%")
The prediction error rate is 24%, outperforming the model in Exercise Set 3, which had about 26% error. According to the result of 10 fold cross-validation, the model has an average error rate of 23.51%, a bit lower than the results from training model.
Perform cross-validation to compare the performance of different logistic regression models (= different sets of predictors). Start with a very high number of predictors and explore the changes in the training and testing errors as you move to models with less predictors. Draw a graph displaying the trends of both training and testing errors by the number of predictors in the model.
date()
## [1] "Sun Nov 26 15:23:08 2023"
Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
##
## area
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The data set is about housing values in suburbs of Boston. There are 506 observations of 14 variables, including numeric and integer variables. The 14 variables respectively refer to: 1. crim: per capita crime rate by town. 2. zn: proportion of residential land zoned for lots over 25,000 sq.ft. 3. indus: proportion of non-retail business acres per town. 4. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). 5. nox: nitrogen oxides concentration (parts per 10 million). 6. rm: average number of rooms per dwelling. 7. age: proportion of owner-occupied units built prior to 1940. 8. dis: weighted mean of distances to five Boston employment centres. 9. rad: index of accessibility to radial highways. 10. tax: full-value property-tax rate per $10,000. 11. ptratio: pupil-teacher ratio by town. 12. black: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town. 13. lstat: lower status of the population (percent). 14. medv: median value of owner-occupied homes in $1000s.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
# Show summaries of the variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# First make a distribution plot for 14 variables
library(ggplot2)
library(tidyr)
# Convert the data to long format for easier plotting
long_boston <- gather(Boston)
# Plotting
p1 <- ggplot(long_boston, aes(x = value)) +
geom_density(fill = "skyblue", color = "black") +
facet_wrap(~key, scales = "free") +
theme_minimal() +
labs(title = "Overview of Boston dataset")
# Print the plot
p1
# Then make a correlations plot to look at the correlations among the 14 variables in the data
library(corrplot)
## corrplot 0.92 loaded
# Calculate the correlation matrix and round it
cor_matrix <- cor(Boston) |>
round(digits = 2)
# Print the correlation matrix
print(cor_matrix)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# Visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper")
From the distribution plot, most of the variables are skewed to right or
left direction. Only rm is realatively normally distributed. From the
correlation matrix, the strongest correlations are between the variables
rad and tax (positive), dis and age (negative), dis and nox (negative),
dis and indus (negative), lstat and medv (negative).
Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# Center and standardize variables
boston_scaled <- scale(Boston) |>
as.data.frame()
# Summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
After scaling the data, all the means of the variables turn to zero. Most of the values of the variables ranged from -4 and 4, only except for crim (crime rate).
# Summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# Create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# Create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE,
labels = c("low", "med_low","med_high", "high"))
# Look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
library(dplyr)
# Number of rows in the Boston dataset
n <- nrow(boston_scaled)
# Choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# Create train set
train <- boston_scaled[ind,]
# Create test set
test <- boston_scaled[-ind,]
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
# Fit the linear discriminant analysis
lda.fit <- lda(crime~., data = train)
# Print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2450495 0.2400990 0.2599010
##
## Group means:
## zn indus chas nox rm age
## low 0.94768689 -0.9154164 -0.15765625 -0.8664804 0.49324129 -0.8330934
## med_low -0.06128399 -0.2905159 -0.03371693 -0.5543909 -0.09719511 -0.3794097
## med_high -0.37054365 0.2063740 0.05238023 0.3768264 0.15156095 0.3976410
## high -0.48724019 1.0149946 -0.04735191 1.0267822 -0.39946717 0.8122921
## dis rad tax ptratio black lstat
## low 0.8750206 -0.6897369 -0.7780503 -0.4386754 0.38100748 -0.767876258
## med_low 0.3826266 -0.5456857 -0.4654718 -0.1017024 0.31146285 -0.162813217
## med_high -0.4097682 -0.3756701 -0.2603363 -0.2604135 0.09270391 0.006875968
## high -0.8535887 1.6596029 1.5294129 0.8057784 -0.76867030 0.834420910
## medv
## low 0.5582392
## med_low 0.0121384
## med_high 0.2007779
## high -0.6528397
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11885002 0.66122613 -0.98747678
## indus -0.04130912 -0.23640126 0.23839579
## chas -0.07358997 -0.01543395 0.21302462
## nox 0.33296411 -0.76745164 -1.35199275
## rm -0.08676976 -0.07017084 -0.15558456
## age 0.29552614 -0.15968266 -0.24988799
## dis -0.11193174 -0.11699745 0.17008638
## rad 3.12327648 1.09445683 -0.22148103
## tax 0.10112350 -0.22764844 0.78165323
## ptratio 0.14306963 0.10445909 -0.33126409
## black -0.15592597 -0.01390650 0.09298486
## lstat 0.17766584 -0.24785436 0.32723953
## medv 0.17126083 -0.36432554 -0.26921116
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9537 0.0335 0.0128
# Draw the LDA (bi)plot
# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# Target classes as numeric
classes <- as.numeric(train$crime)
# Plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Biplot based on LD1 and LD2 was generated. From the LDA biplot, the
clusters of low, med_low, and med_high classes separate poorly. Heavy
overlap was observed between these three clusters. The cluster of high
class separates well. But the clusters high and med_iumH_high also
showed notable overlaps. Based on arrows, varaibles rad explained the
most for cluster of high class. Contributions of variables to other
clusters are not clear enough due to the heavy overlap.
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.
# Save the correct classes from test data
correct_classes <- test$crime
# Remove the crime variable from test data
test <- dplyr::select(test, -crime)
# Predict classes with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)
# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 8 0 0
## med_low 5 16 6 0
## med_high 0 15 14 0
## high 0 0 1 21
The cross tabulated results show that most of the predictions on the classes of med_low, med_high, and high are correct. But the prediction of the low class has just only less than a half correctness. This result shows a not so satisfactory predicting effect of our linear discriminant analysis.
Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
data("Boston")
boston_scaled <- scale(Boston) |>
as.data.frame()
# Euclidean distances matrix
dist_eu <- dist(boston_scaled)
# Look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
set.seed(123)
k_max <- 10
# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# Visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The optimal number of clusters is when the value of total WCSS changes
radically. In this case, two clusters would seem optimal.
# K-means clustering with 2 clusters
km <- kmeans(boston_scaled, centers = 2)
# Plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
pairs(boston_scaled[1:7], col = km$cluster)
pairs(boston_scaled[8:14], col = km$cluster)
Most of the variables are influential linear separators for the
clusters, except rad, ptratio, and tax.
Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points. Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# Install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p3 <- plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
color = train$crime, #Set the color to be the crime classes of the train set.
type= 'scatter3d',
mode='markers',
size = 2)
p3
# Draw another 3D plot where the color is defined by the clusters of the k-means
# Get the clusters of k-means for the train set
train.km <- kmeans(model_predictors, centers = 2)
p4 <- plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type= 'scatter3d',
mode='markers',
color = factor(train.km$cluster), #color defined by clusters of the k-means
size = 2)
p4
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
The LDA was trained according to a mathematical category of crime rates (quantiles), which has four categories. While k = 2 was adopted for the k-means clustering base on the size of within-cluster sum of square. Since LDA is a supervised technique, we know what are each categories represent, which are also labeled in the caption. K-means clustering is a unsupervised method and thus I do not know anything about the real-world representation of the 2 clusters identified before observing closely. However, by observing the 3D plots together, it is interesting to find out that, cluster 2 from k-means nicely overlaps with high category from LDA. Also, cluster 1 from k-means roughly overlaps with the other three categories from LDA.